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Abstract — Representation learning and unsupervised learning are two central topics of machine learning and signal processing. Deep 
learning is one of the most effective unsupervised representation learning approach. The main contributions of this paper to the 
topics are as follows, (i) We propose to view the representative deep learning approaches as special cases of the knowledge reuse 
framework of clustering ensemble, (ii) We propose to view sparse coding when used as a feature encoder as the consensus function of 
clustering ensemble, and view dictionary learning as the training process of the base clusterings of clustering ensemble, (ii) Based on 
the above two views, we propose a very simple deep learning algorithm, named deep random model ensemble (DRME). It is a stack of 
random model ensembles. Each random model ensemble is a special fc-means ensemble that discards the expectation-maximization 
optimization of each base fc-means but only preserves the default initialization method of the base /c-means. (iv) We propose to select 
the most powerful representation among the layers by applying DRME to clustering where the single-linkage is used as the clustering 
algorithm. Moreover, the DRME based clustering can also detect the number of the natural clusters accurately. Extensive experimental 
comparisons with 5 representation learning methods on 19 benchmark data sets demonstrate the effectiveness of DRME. 

Index Terms — Clustering, deep learning, dictionary learning, ensemble learning, sparse coding, unsupervised representation learning. 
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1 Introduction 

REPRESENTATION learning is to learn transforma- 
tions of the data that makes it easier to extract useful 
information when building classifiers or other predictors 
||l|. Popular representation learning techniques include 
ICA in source separation, PCA in dimension reduction, 
kernel learning in classification, and Bayesian nonpara- 
metric models in data modelling. As was argued by 
Hinton et al. ^2J, these methods are all shallow models that 
learn linear or only one layer of nonlinear transforma- 
tions, so that (i) their representative powers are limited, 
(ii) the numbers of their parameters grow rapidly with 
the size of the data set, or (iii) they have both of the 
aforementioned weaknesses. Therefore, the deep models, 
which contain multiple layers of nonlinear transforma- 
tions, are suggested as one of the recent advances. The 
main advantage of the deep models over shallow ones 
lies in that 'Junctions that can he compactly represented by 
a depth k architecture might require an exponential number 
of computational elements to be represented by a depth k — 1 
architecture" |3|. 

The main difficulty of the deep models is that multiple 
layers of nonlinear transformations make the models 
suffer severely from bad local minima. In 2006, a break- 
through of training the deep models was made by Hin- 
ton et al. 1 2 1, followed by revolutionary improvements 
on image processing and speech recognition ||4J, I^SJ. 
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Currently, the successful training method of deep 
belief networks (DBN) in |2| becomes a standard one. 
It consists of two phases - the unsupervised greedy 
layer-wise pre-training phase and the supervised fine- 
tuning phase. The pre-training phase is the key idea of 
deep learning that helps the deep models get rid of bad 
local minima. It is also an active area the researchers 
enjoy in. The pre-training phase aims to train a stack of 
shallow modules successively, where the input data of 
each module is the output of its ancestor (i.e. previous) 
module. The representative shallow modules include the 
restricted boltzman machine (REM) p) and denoising 
autoencoder (DAE) ||6J, See |^ for excellent 
reviews. 

However, deep learning is far from explored and 
understood yet. In this paper, we pay attention to the 
following two key respects. First, current discussions on 
deep learning are still limited to probabilistic graphic 
models and neural networks, other meaningful interpre- 
tations and successful building blocks are seldom seen. 
Second, existing deep models are still too complicated 
for a wide range of applications. As we known, a widely 
used method should be simple and fast, such as the 
/c-means clustering. Also, currently, there is a trend of 
simplifying the state-of-the-art modeling techniques for 
efficiency, such as using /c-means to learn feature rep- 
resentations |8| and discussing the relationship between 
the Dirichlet process and the /c-means |9|, |10|. For the 
above two respects, we focus on discussing the following 
two problems: 

• How to understand the success of the unsupervised 
pre-training of deep learning, so as to guide the 
design of new building blocks? 

• Can we get a very simple deep learning method that 
a freshman can play with? 
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For problem 1, (i) we view the existing successful 
building blocks |2l, f6l, fz] in the perspective of en- 
semble learning, and view the unsupervised layer-wised 
pre-training phase as a knowledge reuse framework of 
clustering ensemble, so that a vast amount of ensemble 
learning techniques are available for designing new 
building blocks. Ensemble learning pT| is an important 
branch of machine learning that aims to combine a serial 
base learners for a stronger one. Clustering ensemble is 
the unsupervised extension of ensemble learning p2)- 
p5) . The success of ensemble learning are supported 
by two basic criteria - meaningful base learners and 
strong diversity among the base learners. See Section [4T| 
for a further introduction, (ii) We further pay particular 
attention to the important experimental phenomena in 
p6) on sparse coding, where sparse coding is an im- 
portant shallow representation learning approach. If we 
view sparse coding (when used as a feature encoder) 
as the consensus function of clustering ensemble, and if 
we view dictionary learning as the training process of 
clustering ensemble, we can explain the success of the 
very simple sparse coding methods in |[l6j easily. 

For problem 2, we first use the /c-means ensemble flS^ 
as the building block of a deep architecture. Because 
the /c-means ensemble is very inefficient, we discard 
the expectation-maximization (EM) optimization of the 
/c-means but only preserve the default initialization 
method of the k centers of the /c-means - random 
observation sampling. The proposed DRME contributes 
to such a great simplification of deep learning that it 
even does not need an obvious optimization objective 
and does not need any sophisticated optimization algo- 
rithm. Although the proposed algorithm is so simple, 
it performs surprisingly well in practice, such as our 
application to clustering. 

The key idea why we discard the EM optimization 
are motivated step by step as follows: (i) After viewing 
the building blocks of the representative deep learning 
approaches as special cases of clustering ensemble, we 
take the two basic criteria of clustering ensemble as our 
design criterions. One key criterion is how to train a 
meaningful base clustering, (ii) After viewing the suc- 
cessful approximation of the contrastive-divergence (CD) 
training [ 17J , [ ,18] to maximum likelihood training for 
DBN, we find that even reducing the maximum iteration 
number of the EM training gradually from a large num- 
ber to zero, the randomly sampled /c-centers can still be 
a meaningful base clustering, (iii) After explaining the 
confidential experimental phenomena of sparse coding 
in 1 16 1 in the perspective of clustering ensemble and 
further building a relationship between the work in p6) 
and the proposed DRME, we find a strong empirical 
support of the proposed DRME in literature. 

The main contributions are summarized as follows. 

• We view the representative deep learning ap- 
proaches 1 2 1, 1 6 1, |7| as special cases of the knowl- 
edge reuse framework of clustering ensemble |12|. 

• We explain the success of the simple sparse coding 



approaches when used as feature encoders in ||T6) 
in the perspective of clustering ensemble. 

• We propose a very simple and fast deep learning 
algorithm, called DRME. 

• We propose a new scheme on how to find the 
most powerful representation among the layers by 
applying DRME to clustering. The DRME based 
clustering, as a by-product, can also detect the 
number of the natural clusters automatically |12|, 
p3) , p9) , which is a well-known hard problem of 
clustering. 

• We conduct an extensive experimental comparison 
with 5 state-of-the-art unsupervised representation 
learning algorithms on 19 benchmark data sets. 

The remainder of this paper is organized as follows. In 
Section|2} we will present the proposed DRME algorithm 
"suddenly'' so as to give the reader a first image on 
how simple our algorithm is. In Section |3j we will 
apply DRME to clustering. In Section |4j we will review 
three related topics for preparing the discussion on our 
motivation in Section |5| where the three related topics are 
clustering ensemble, deep learning, and sparse coding, 
respectively. In Section |5} we will first present how we 
view the popular deep learning methods as stacked clus- 
tering ensembles, and then explain why we can reduce 
the clustering ensemble to the random model ensemble. 
In Section [6| we will conduct an extensive experimental 
comparison, where the performance is evaluated by the 
clustering accuracy and running time. At last, in Section 
|7j we will conclude this paper. 

We first introduce some notations here. Bold small 
letters, e.g., w and a, indicate column vectors. Bold 
capital letters, e.g., W, K, indicate matrices. Letters in 
calligraphic bold fonts, e.g.. A, B, and R, indicate sets, 
where denotes a d-dimensional real space. The oper- 
ator II • 11^ denotes the m-norm, where m is a constant. 

2 Deep Random Model Ensemble 

In this section, we will first review the key idea of deep 
learning. Then, we will present the deep random model 
ensemble and analyze its time and space complexities. At 
last, we will illustrate the effectiveness of the proposed 
method on a handwritten digit recognition problem. 

2.1 Preliminary 

For the unsupervised representation learning, we are 
interested in learning a mapping fo from input X = 
[xi, . . . , Xn] to a novel representation Y = [yi, . . . , yn], 
i.e. Yi = fe{^i)yi = 1, . . . ,n, where x^ G M^, ji G M^, 
and is the parameter of the mapping function. 

For a deep architecture with L layers, we aim to 
learn Y through L mapping functions {/6»J/Li/ i-^- Yi = 
feL {feL-i {"'feii^i))) ,Vi = l,...,n. 

For the unsupervised layer-wise training of a deep 
architecture, we train each mapping function indepen- 
dently with the input of the mapping as the output of its 
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Fig. 1 . Diagram of the architecture of the proposed deep 
random model ensemble. 

ancestor mapping function, which can be formulated as a 
problem of learning the following functions successively: 



(2) 



A (L) 



= foi (Xi), 

fe. (yf , 



Vz = 1 , . . . , n 



(1) 



where Oi and y*^^^ are the parameter and output repre- 
sentation of the l-th layer respectively with / = 1, . . . , L 

and y(^) G M^^ 

2.2 Algorithm Description 

The key idea of the proposed DRME is to stack multiple 
random model ensembles, where the random model 
ensemble is a reduced /c-means ensemble |13| that pre- 
serves the default initialization method (i.e. random 
observation sampling) of the /c-centers of each base k- 
means clustering but discards the EM optimization of 
the base /c-means. We present the DRME algorithm in 
detail as follows with a schematic diagram of the deep 
architecture shown in Fig. [T] 

The key step of developing a deep learning algo- 
rithm is to design fo^ for balancing the effectiveness 

and 
ding 



and efficiency of the algorithm. See Section |43 
Ref. |1| for reviews. DRME is also a stack of bui 



blocks {/6';}^i- But unlike the existing deep learning 
algorithms, the l-th building block foi is an ensemble 

of random models, denoted as j^i^^j 

Each random model Qv"^ consists of the following two 
phases: 

• Random observation sampling and nonlinear 
transform. The parameter of gy-' is k randomly se- 
lected observations from Y*^^"^^, denoted as M^^^ = 
where A: is a positive integer that 
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is randomly chosen from a given range, denoted as 
[A^min, A:max] with A:max > kmin > 2. Like the A:-means 
clustering, we regard the selected k observations 
as k centers of the random model gv \ so that the 
observation yf Vi = 1, . . . ,n is predicted as the 



In this paper, the Euclidean distance is used as the 
metric, so that the prediction function is defined as: 

2 
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,Vz = l,...,n, 
Vj = l,...,^. (2) 



Note that (i) the prediction via the Euclidean dis- 
tance is regarded as a nonlinear transform of the 
input, and (ii) different random models have differ- 
ent k. 
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• Sparse coding. Suppose z, 

to a k dimensional indicator vector yfl, i.e. y -'^ 

Vi^l,!' • • • ' yi[l,k ' where the vector yf^^ takes 1 for 
tne j-th element and for the others. This 1-of-A: 
coding method is a common strategy in multiclass 
problems, such as /c-means. 
After getting the outputs of the i-th observation from 

all random models | gv^ \ , i.e. | yf ^ | , we concate- 
nate these sparse vectors to a long one: 



(3) 



Finally, we get the l-th random model ensemble 
l^i^^l and the l-th feature representation Y*^^) = 

^^^^ , . . . , yi^^ for the (/ + l)-th layer. 

When the dimension of the sparse representation Y*^^^ 
is much larger than the size of the data set, i.e. d'^^^ n, 
we may take the similarity matrix Z*^^^ as the input of 
the (/ + l)-th layer instead of Y'^^^ which is a scheme 
we have adopted in all experiments of this paper. Given 
the sparse representation Y, the similarity matrix Z is 
calculated by: 



Z = -Y' Y 



(4) 



where Z = [zi, . . . , z^] with z^ G [0, 1]^, Vi = 1, . . . , n. The 
similarity matrix has been adopted in many well-known 
algorithms, such as |12, Section 3.2] and [ ,13^ Section 3.2]. 
Note that the n x n similarity matrix might be further 
compressed by the Nystrom method fSof. 

Here, we remark three items, (i) The random observa- 
tion sampling is very important to the success of DRME. 
It is a "meaningfur' base learner that is slightly better 
than a random guess. See Sections |4.1| and |5.3| for a 
further discussion, (ii) The 1-oi-k coding method can be 
seen as one of the simplest sparse coding methods [21], 
see Sections |43l and [5!2l and |16| for a further discus- 
sion, (iii) We may further improve the performance by 
enlarging the diversity between the random models via 
the random feature selection. This topic is beyond the 
scope of this paper. 

2.3 Complexity Analysis 

For facilitating our analysis, we do not consider the 



shortest distance between y • ^ and the k centers, difference between the layers. We make the following 
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notations: The depth of DRME is L. For each layer, the 
random model ensemble consists of V base clusterings; 
the average number of the output clusters of each base 
clustering is /c; the input feature of each layer is a d x n 
matrix with each column representing an observation; 
the sparsity of the input is s, where the matrix sparsity 
is defined as the ratio of the number of non-zero entries 
to the size of the matrix. 

2. 3. 1 Computational Complexity 

It is easy to see that each base clustering g has a compu- 
tational complexity of {dnsk). Hence, the computational 
complexity of DRME is {dnskVL). Because there exists 
the following relation: 

d = Vk (5) 

we can conclude that the computational complexity of 
DRME is about {nsk^V^L). 

If we take the similarity matrix in Eq. (|4| as the 
input of each layer, the complexity of each base clus- 
tering is about {in?sk). Calculating the similarity ma- 
trix needs an additional complexity of about {n^ds^). 
Therefore, the computational complexity of DRME is 
about {p?skVL ^ n^ds^L). Substituting Eq. ^ to the 
complexity derives {ii?skVL + n^s^kVL). 

In practice, we usually set both k and V to large 
values, e.g. /c ^ 50 and V = 2000, so as to guarantee 
the robustness of the performance. Hence, it is easy to 
observe that taking the sparse representation as the input 
of each layer is suitable to large scale problems, while 
taking the similarity matrix as the input of each layer, 
which is the case of this paper, is suitable to small scale 
problems. 

2.3.2 Storage Complexity 

For the /-th layer, we need to store its whole input and 
output, which requires an {2dns) space. We also need 
to store f^^\...^f^^\ which requires an (kVd) space. 
Summing the two items equals to {2dns + kVd). Substi- 
tuting Eq. (|5| to the summation can reach the conclusion 
that the storage complexity of DRME is {2nskV -{-k'^V'^). 
Because k and V does not have a direct relationship with 
n, the overall storage complexity is linear with respect 
to the size of the data set. 

For small scale problems, if we take the similarity 
matrix as the input of each layer, we need an additional 
storage complexity of (n^) for the similarity matrix, 
which is the case of this paper. 

2.4 Effectiveness of DRME: A Visualized Example 

Because it is assumed in machine learning that the 
observations in a high-dimensional space are triggered 
by very few independent factors that lies in a low- 
dimensional subspace, the effectiveness of the learned 
representation is judged by whether the observations 
that come from different classes can be well separated 
in a low-dimensional embedding subspace. Hence, if 



(a) Original feature (b) Layer 1 




Fig. 2. Visualizations of different feature representations 
of the Optdigits data set. (a) Original features, (b)-(f) 
Learned feature representations by DRME at different lay- 
ers (i.e. depths). The images are get via PCA. The images 
in the dashed boxes of Figs, (c), (d), (e), and (f) are further 
amplified in Figs.[3}:,[3]d,[3^, and [5}, respectively. 

TABLE 1 

Parameter settings of the deep random model ensemble. 



Description of the parameter 


Maths Notation 


Value 


Depth of DRME 


L 


10 


Number of random models per layer 


V 


2000 


Minimum number of clusters per random model 


fcmin 


10 


Maximum number of clusters per random model 


fcmax 


100 



we extract the low-dimensional information from the 
learned feature representations by e.g. PCA, a good 
representation can yield the following clear pattern: 
the observations from the same factor are concentrated, 
while the observations from different factors are well 
separated. 

In this subsection, we run DRME on the optical recog- 
nition of handwritten digits (Optdigits) data setj^ The 
Optdigits data set is a widely used benchmark data set in 
the UCI machine learning repository. It contains 10 hand 
written integer digits ranging from to 9. It consists 
of 5620 observations and 64 attributes (i.e. dimensions). 
Each digit consists of about 560 observations. The pa- 
rameter settings of DRME are summarized in Table [l] To 
visualize the learned representations, we project them to 
a 2-dimensional subspace by PCA. 

The result is shown in Fig. |2] The images in the dashed 

1. |http^ / archiveics.uci.edu/ ml/ datasets/ Optical+Recognition+of+ 1 
Hand written+ Digits] 
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(c) Amplification of layer 3 




(d) Amplification of layer 5 
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(f) Amplification of layer 9 
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Fig. 3. Amplifications of the images in tlie daslied boxes 
of Fig. [2] Figs, (c), (d), (e), and (f) are the amplifications 
of Figs!]2}:,||]d,[5^, and|5|, respectively. 

boxes of Figs. 2 :, [2]l, [2^, and [2]P are amplified in Figs. 
|3|:, [3]l, |3^, and 3 f, respectively. From the figures, we can 
see clearly that when the depth of DRME increases, the 
observations from the same digit are becoming more and 
more concentrated while the observations from different 
digits are becoming more and more separated, which 
fully meets our expectation. 

3 Deep Random Model Ensemble for 
Clustering 

In this section, we will first present the importance of 
applying DRME to clustering. Then, we will present 
the DRME based clustering in detail. At last, we will 
illustrate the effectiveness of the DRME based clustering 
with one visualized example. 

3.1 Motivation 

Applying DRME to clustering has two important goals: 
(i) detecting the most powerful representation among the 
layers, and (ii) detecting the natural clusters. The first 
goal is the main objective of this application, while the 
second one can be regarded as an important by-product 
of the application. 

3.1.1 Why Do We Use Clustering to Detect the Most 
Powerful Representation? 

When the depth is chosen properly, the learned repre- 
sentation is close to the underlying smooth manifold. 
However, when the depth of DRME is not deep enough, 
we may not get a smooth feature representation, which 
is known as under-fitting. Also, when the data set is 
not large scale, we get the risk of over-fitting. Hence, it 



is important to decide which representation we should 
pick up among the layers. 

If we have no knowledge about the data, cluster- 
ing seems the only way for this problem. Moreover, 
compared to the supervised learning, the performance 
of clustering is much more sensitive to the shape of 
the representation, hence, using clustering to detect the 
changes of the representations among the layers is better 
than using supervised learning. 

3. 1.2 Why Is This Application Important to Clustering? 
Clustering is the process of partitioning a set of data 
observations into multiple clusters so that the observa- 
tions within a cluster are similar, and the observations in 
different clusters are very dissimilar |22|. Data represen- 
tation is the core problem of clustering. Specifically, as 
summarized in |23, Section 3], data clustering has four 
challenges, which are the (i) data representation, (ii) pur- 
pose of grouping, (iii) number of clusters, and (iv) cluster 
validity. Among the challenges, data representation is 
the base of the other three. First, different purpose of 
grouping needs different data representations. Second, 
as shown in |23, Fig. 5] and Fig.|2| a good representation 
that reflects the essential factors can result in a clear data 
structure, so that a simple clustering algorithm can reach 
a valid partition. 

3.2 DRME Based Clustering 

Any clustering algorithm that is able to yield unfixed 
number of clusters can be combined with DRME. Gen- 
erally, clustering algorithms can be divided into two 
groups: partitional and hierarchical, see ||23|, ||24| for ex- 
cellent reviews. Although some partitional algorithms 
can yield unfixed number of clusters, such as Dirichlet 
process mixture models |25| or support vector clustering 
|26|, in this paper, we prefer the representative hier- 
archical clustering methods, such as single-linkage or 
complete-linkage, since they are simple, fast, and need 
no parameter tuning. 

In this paper, the single-linkage clustering is used. It is 
an agglomerative hierarchical clustering method. Specif- 
ically, it builds a hierarchical-tree on the data. Each 
merging of two leaves (i.e. clusters) generate a new 
partition of the data. If we record the distances between 
the merged leaves, the tree can be presented as a vector 
with n - 1 elements (i.e. distance records), denoted as 
p = [pi^ . . . ^pn-iV with pi and Pn-i as the last and 
first mergings respectively. Note that p is in the descend 
order with pi as its largest value. 

As ||13| Section 3.3] did, the number of clusters is 
selected as the one that yields the longest cluster lifetime 
113 - Fig. 3]: 



k"" 



arg maxp/e_i 

k 



■Pk, V/c = 2, . . . ,n - 1. 



(6) 



However, because a manually-defined class might have 
several natural clusters, selecting a good representation 
according to the number of clusters only might be too 
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(b) Layer 2 (k=2) 



(c) Layer 3 (k=2) 



(a) Number of detected clusters 




(c) Convergence behavior 



Fig. 4. Dendrograms produced by the single-linkage on 
the Optdigits data set. k in the title of each dendrogram is 
the number of the detected clusters. 



arbitrary. Moreover, it is not robust: a slight disturbance 
on the representation might yield a very different k^, so 
that it is hard to design a simple criterion that is based 
on the longest cluster lifetime for the representation 
selection problem. 

In this paper, we propose to select the robust feature 
representations according to the distance between the 
normalized hierarchical-trees of the successive two lay- 
ers. Specifically, given the trees of all layers {p*^^^}/^T, we 



first normalize the tree of each layer by p*^^^ = p*^^^ /p\ 
and then calculate the distance between the successive 
two normalized trees as: 

2 



^(0 



(7) 



At last, we pick up the output of the first layer that 
satisfies the following inequality as the learned repre- 
sentation: 



- qi--i\ 



maxf 



=2 qi -qi-i\ 



(8) 



where l"^ represents the layer, and r] e (0, 1) is a user 
defined constant. Note that this criterion is only an 
empirical one. The monotonic decrease of ^2, ^3, • • • , 
is unguaranteed. 

3.3 Effectiveness of the DRME Based Clustering: A 
Visualized Example 

In this subsection, we will run the DRME based clus- 
tering on the Optidigits data set. The accuracy of the 
proposed clustering algorithm is evaluated as comparing 
the predicted labels with the ground truth labels using 
normalized mutual information (NMI), where NMI was 




Number of layers 



Number of layers 



Fig. 5. Experimental results of the DRME based cluster- 
ing on the Optdigits data set. (a) Curve of the detected 
cluster numbers, (b) Accuracy curve, (c) Curve of the dis- 
tances between the successive normalized hierarchical- 
trees. 



(a) Target similarity 




(b) Similarity matrix of original feature (c) Similarity matrix of learned feature at layer 9 




\ 



Fig. 6. Similarity matrix comparison on the Optdigits data 
set. (a) Target similarity matrix calculated from the ground 
truth labels, (b) Similarity matrix on the original features, 
(c) Similarity matrix on the learned feature representation 
at layer 9. 



proposed in 1 12, Eq. (3)] and has been one of the standard 
metrics for clustering. 

The dendrograms produced by the single-linkage are 
shown in Fig. |4] From the figures, we can see clearly that 
the proposed DRME has a strong denoising ability. 

The experimental results are summarized in Fig. |5] 
From the figure, we can observe that (i) the unstable 
variation of the detected clusters does not affect the 
clustering accuracy much, (ii) the clustering accuracy is 
mainly determined by the learned representation, and 
(iii) rather than taking the number of the detected clus- 
ters as the representation selection criterion, using the 
distance between the successive normalized hierarchical- 
trees as the selection criterion is a good choice. 

Note that the /c-means clustering provided with the 
true cluster number can only achieve an NMI of 72.93% 
while the DRME based clustering can achieve 75.00%, 
which further demonstrates the power of the deep rep- 
resentation learning algorithm. 

4 Related Work 

In this section, we will briefly present three related topics 
which are clustering ensemble, deep learning, and sparse 
coding, respectively. 

4.1 Clustering Ensemble 

Before reviewing clustering ensemble, we should first 
review the supervised ensemble learning where the clus- 
tering ensemble is rooted in. 
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Ensemble learning aims to combine a group of diverse 
base learners together for a better performance. The 
success of the ensemble methods relies heavily on the 
following two basic criteria [TT) . 

• A meaningful selection of the base learner. The key- 
word "meaningfur' means that the base learner 
needs to be at least better than a random guess. 

• A strong diversity among the base learners. The key- 
word "diversity'' means that when the base learners 
make predictions on an identical pattern, they are 
different from each other in terms of errors. 

As presented in [TTl, there are generally four groups of 
ensemble learning methods, which are the methods of 
manipulating the training examples |27|, manipulating 
the input features |28|, manipulating the training param- 
eters 1 29 1, and manipulating the output targets |30|. 

Clustering ensemble is an extension of the ensemble 
learning to unsupervised learning ]12|-[15[. The key 
advantage of the clustering ensemble to single clustering 
is that a group of clusterings are capable of grasping 
the shape of highly variant data and have the potential 
of preventing bad local minima that most clustering 
algorithms suffer from. Typically, a clustering ensemble 
is broken into two components: (i) a group of cluster- 
ings that yield different partitions, and (ii) a consensus 
function that aims to combine the partitions (i.e. the base 
clusterings). As its supervised counterpart, clustering 
ensemble should satisfy the aforementioned two key 
criteria, and can construct diverse base clusterings in the 
aforementioned four ways. Currently, researchers focus 
on designing the consensus function which is a self- 
contained problem of clustering ensemble. See |[T4| for 
an excellent review of the consensus function. 

But if we regard the group of different partitions as a 
new feature representation of the original data, and if we 
regard the consensus function as a clustering algorithm 
running on the new representation, the clustering ensem- 
ble problem is reduced to a single clustering problem 
applied on the output of one layer (probably nonlinear) 
transform of the original data. Because, as has been 
summarized in |23|, |24|, the clustering performance is 
mostly decided by the shape (i.e. the feature represen- 
tation) of the data but not the clustering algorithm, it 
might be better for us to pay more attention to the un- 
supervised feature representation learning subproblem. 

Clustering ensemble contributes to the key motivation 
of this paper. First, it motivates us to view the represen- 
tative deep learning approaches as a stack of clustering 
ensembles. Second, the two basic criteria of ensemble 
learning contributes to the guidance of our design of 
the proposed DRME. Third, the four types of diversity 
enhancement techniques contribute to the implementa- 
tion skill of our random model ensemble in each layer. 
Fourth, the clustering ensemble problem provides us a 
good testing environment about whether the learned 
feature representation can reveal the underlying natures 
of the data. 



4.2 Deep Learning 

In Bengio et ah have conducted an excellent review 
on deep learning and representation learning. Here, we 
briefly summarize part of its content that is related to 
this paper. 

Existing deep learning approaches can be categorized 
to two classes, which are rooted in probabilistic graphic 
models and neural networks respectively ||^ Section 5]. The 
main difference between them are how to interpret the 
hidden units: latent random variables in probabilistic 
graphic models or computational nodes in neural net- 
works? 

The representative method rooted in probabilistic 
graphic models is the deep belief networks (DBN) Q. 
Its building block is RBM, which is a typical kind of 
undirected graphic models that lies in the exponential 
families. The main merit of RBM to the popular directed 
graphic models is that the conditional distribution over 
the hidden units can factorize given the visible units, and 
vise versa, so that most inferences are readily tractable 
|1, Section 6.2.1]. The objective of RBM is to maximize 
the likelihood of the input. It is solved by the stochastic 
gradient descent algorithm. The detailed derivation of 
the algorithm can be found in |3|. The main difficulty of 
the optimization is that the expectation of the partition 
function (the normalization term) of the probabilistic 
model is still computationally untractable, so that the 
expensive Markov chain Monte Carlo (MCMC) sampling 
has to be used for this function. Surprisingly, in jl7| , 
1 18 1, Hinton found that it is needless to carry out the 
full MCMC, conducting only few steps of MCMC can 
also achieve good results. This biased approximation 
of maximum likelihood learning is named contrastive 
divergence (CD) learning. In |T7|, |18| and |1, Section 9.4], 
Hinton and Bengio et al. have tried to explain why the 
CD learning can provide a reasonable approximation of 
the maximum likelihood learning. 

The representative method rooted in neural networks 
is the stacked denoising autoencoder (SDAE) \6j, ^7J. 
Its building block is DAE, which is a regularized au- 
toencoder. Compared to the probabilistic graphic model 
based approaches, the main merit of DAE is that it not 
only defines a simple tractable optimization objective 
that prevents dealing with the complicated partition 
function, but also can take the output of the autoen- 
coder as the learned representation directly |1, Section 
7]. Compared to the non-regularized autoencoder, the 
main merit of DAE is that it can learn over-complete 
representations, i.e. y^^"^ > y^^~^\ and meanwhile prevent 
learning nothing but duplicating the inputs \1, Section 
7.2]. The objective of DAE is to minimize the recon- 
struction error, which is formulated as the following 
optimization problem: 

n 

min^€(x„/i,(/,(x,))) (9) 
where x is a noise-corrupted version of x, fo is the 
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encoder, he is the decoder that will be discarded in the 
final network, and I is the risk function. £(x, y) can be 
defined as the squared loss ||x — y|p for unbounded 
real-valued x and y, or the binary cross-entropy loss 
- E^li log(^t) + (1 - yt) log(l - xt) for X and y that 
are bounded in the range [0, 1]. 

The two representative deep learning approaches con- 
tribute to two ideas of this paper. First, different from 
the above two branches, this paper view the existing 
deep learning approaches in a new perspective - knowl- 
edge reuse algorithms of clustering ensemble, where the 
building blocks, such as RBM and DAE, can be inter- 
preted as clustering ensemble approaches, see Section 
5.1 for a detailed discussion. Second, the interesting CD 
learning contributes partially to the idea of reducing the 
/c-means ensemble | [13| to the random model ensemble, 
see Section EH for a detailed discussion. 



4.3 Sparse Coding and Dictionary Learning 

Sparse coding is an important unsupervised represen- 
tation learning approach. It has been widely used in 
computer vision and image processing, and is an active 
subfield of machine learning. Suppose we are to learn 
a -dimensional sparse representation of the input x, 
denoted as y. The basic problem of sparse coding is 
formulated as the following optimization problem: 



mm 

{y.}r=i,M 



^||x,-My,||^ + A||y,||i, 



(10) 



1=1 



subject to ||M:,^-||^ 



Vj = 1, 



where M is a d x matrix variable, called the dictionary 
or the basis set, with M:^j representing the j-th element 
of the dictionary, called the basis vector, and A is a user 
defined parameter. The sparsity of y is enforced by the 
/i-norm penalty. Typically, the alternating optimization 
method is adopted |^|. The method iterates the follow- 
ing two steps. The first step is to optimize y given fixed 
M, and the second step is to optimize M given fixed y. 
The first step can be viewed as an encoding method that 
can be studied and applied independently. The second 
step is also named dictionary learning. 

There are many sparse coding and dictionary learning 
algorithms. In this paper, we pay particular attention 
to |16|. In fT6l, Coates and Ng conducted a broad ex- 
perimental comparison on sparse coding, and drew two 
important experimental conclusions: 

• ''When using sparse coding as the encoder, virtually 
any training algorithm can be used to create a suitable 
dictionary. " 

• ''Regardness of the choice of dictionary, a very simple 
encoder can often be competitive with sparse coding." 

In this paper, we provide a reasonable explanation to 
the experimental phenomena of p6) in the perspective 
of clustering ensemble, and also view the proposed al 
gorithm in the sparse coding perspective, see Section 5.2 
for a detailed discussion. The experimental phenomena 



on sparse coding provide a confidential evidence to the 
correctness of the proposed DRME, see Section 5.3 for a 
detailed discussion. 

5 Motivation 

In this section, we will first explain why we can use the 
random model ensemble as the building block of deep 
learning by analyzing several existing representation 
learning approaches in Sections [5T [5.2 and 5.3 and then 
analyze empirically the key elements that contribute to 
the success of DRME in Section 1531 



5.1 Viewing Deep Learning As A Framework of 
Knowledge Reuse of Clustering Ensemble 

As presented in Section [ij deep learning is a stack of 
shallow models, where each shallow model takes the 
output of its ancestor model as its input. Hence, deep 
learning is a framework of knowledge reuse of shallow 
models. In this subsection, we focus on discussing the 
relationship between the representative shallow models 
(i.e. RBM and DAE) and clustering ensemble. 

5.1.1 Relationship Between Restricted Boltzman Ma- 
chine and Clustering Ensemble 

RBM is a probabilistic clustering ensemble with each 
base clustering as a binary-class probabilistic cluster- 
ing. We present this relationship in detail as follows: 

The central problem of unsupervised representation 
learning is to model complicated smooth distributions 
arbitrarily accurately. As summarized in |17, Section 1], 
the data modeling is categorized to two classes - mixture 
model and product of experts. 

The first class is the mixture model. It aims to com- 
bine a large number of tractable probabilistic models 
by forming a weighted mixture. The general probability 
framework of this class is as follows: 

k k 

P0i,...,0fc(x) = ^7riP0^(x), subject to^yr^ = 1,(11) 

where k e {1,2,.. .,+00} is the number of the mixtures, 
TTi is the weight of the i-th individual model, and pe- 
ls the i-th probabilistic model with 6i as its param- 
eter. One typical model of this class is the Gaussian 
mixture model (GMM). It is well-known that GMM is 
a probabilistic clustering algorithm with each mixture 
modeling a cluster. Another typical model is the k- 
means clustering, which is a small-variance asymptotics 
(i.e. a hard clustering version, or deterministic version) 
of GMM |[32} Chapter 9.3.2]. These models are easily 
optimized via the EM algorithm. However, they are 
ineffective in modeling the posterior distributions that 
are sharper than the individual mixtures. 

The second class is the product of experts. It aims 
to combine multiple individual models by multiplying 
them, where the individual models have to be a bit 
more complicated and each contains one or more latent 
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variables. The general probability framework of this 
class is as follows: 



nti^.(x) 



(12) 



where indexes all possible vectors in the data space, 
and /6>. is called an expert [17| . One typical model of the 
second class is the Bernoulli-Bernoulli RBM model. Its 
expert fe^ is specified as: 



^.(x)= 



hie{o,i} 



(13) 



where hi e {0, 1} is a binary hidden variable, and 
Oi = {ci.Wi^:} with Wi^: as the i-th row of parametric 
matrix W of RBM and Ci as the i-th bias term. See fsl 
for a detailed derivation of Eq. ([13|. The difficulty of this 
class is that the denominator of Eq. ([l2| is untractable, 
so that expensive MCMC has to be used. If we roughly 
view |T3| ) as a binary-class probabilistic clustering, RBM 
is in fact a probabilistic clustering ensemble with each 
base clustering shown in ([13|. More generally, regardless 
of the difficulty of the parameter inference, we can 
substitute ( pT) to ([12| for any complicated clustering 
ensemble model. Hence, it is not surprising that the 
product of experts can achieve superior performance 
than the mixture model in many applications, such as the 
RBM based speech recognition system (without a deep 
structure) over the GMM based one [33 J . 

5.1.2 Relationship Between Denoising Autoencoder 
and Clustering Ensemble 

Both DAE and RBM belong to the class of product 
of experts, i.e. clustering ensembles. The difference 
between them is that DAE is a deterministic clustering 
ensemble while RBM is a stochastic one. This difference 
is analogous to the difference between GMM and k- 
means in the class of mixture model. 

Specifically, DAE is an ensemble of the following 
binary-class deterministic clustering: 



1 



1 



(14) 



where Oi = {q,W^^:} is the classification hyperplane 
of the clustering, x is a random feature sampling of 
X. Note that the random feature sampling is one of 
the most important diversity enhancement techniques of 
ensemble learning p8| . 

Eventually, it is valid to use any clustering ensemble 
whose base learner satisfies the two criteria in Section 
4.1 as the building block of a deep architecture. In this 
paper, we propose to use the framework in |13| as the 
building block. This building block has the following 
two properties: 

• The base learner is a multi-class clustering that can 
partition data to arbitrary number of clusters. 

• The output of the base learner is a 1-of-k sparse 
representation. 



Besides, we may take randomly selected features as the 
input of the base learner as 1 15 1 and DAE did, though we 
have observed no obvious performance improvement on 
the experimental data sets when adopting this scheme. 

5.2 Viewing Sparse Coding As tlie Consensus Func- 
tion of Clustering Ensemble 

In this subsection, we will focus on analyzing two 
interesting experimental phenomena of fT6l, which is 
summarized in Section |4.3} in the perspective of clus- 
tering ensemble. The main conclusion of this analysis 
is that when sparse coding is used as the encoder, it 
is equivalent to the consensus function of clustering 
ensemble, and dictionary learning is equivalent to the 
training process of the base learners. 

Specifically, given a learned dictionary M G 
[— l,l]^^^y, sparse coding aims to solve the following 
optimization problem: 



mm 



El 



|xi-Myi||2 + A||yi||i 



(15) 



It is known that the parameter A controls the sparsity. 
Here, we view it in a different way - a parameter that 
controls the number of the base learners of the clustering 
ensemble. Specifically, if we set A = 0, it is likely that 
contains only one non-zero element. That is to say, we 
group all basis vectors to a single -class clustering, 
which is obviously a weak consensus function. A good 
value of A is the one that can make a small part of the 
elements non-zero. This choice is equivalent to parti- 
tioning the dictionary to several (probably overlapped) 
subsets and then grouping the basis vectors in each 
subset to a base clustering. From this point of view, 
only if the base learners satisfy the two basic criteria, 
no matter how weak the base learners are and what 
kind of sparse coding is used, the performance of sparse 
coding (as an encoder) is guaranteed. This accounts for 
the experimental phenomena of |16|. 

In fact, the random model ensemble in the pro- 
posed DRME is one of the simplest sparse coding 
method, and is implemented in a similar way with 
what we have analyzed above. Specifically, we first 
randomly sample multiple observations (for example, 
500) to form a dictionary; then, we randomly partition 
the dictionary to a serial highly overlapped subsets (for 
example, 2000 subsets) with each subset containing an 
arbitrary number of basis vectors within a given range 
(for example, [10, 100]); finally, we regard each subset as a 
base clustering and adopt the 1-of-A: coding, which is the 
simplest sparse coding method, to each base clustering. 
Note that why we can use the overlapped subsets can 
be explained by the explaining away property of sparse 
coding, see |1, Sections 6.1.1 and 6.1.3] for a detailed 
analysis. 

Moreover, we can also explain two important experi- 
mental phenomena of |[T6), which is not emphasized in 
p6), in the perspective of clustering ensemble, (i) The 
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Fig. 7. Results of the reduction of the EM iterations on the Optdigits data set. "DKME" is short for deep /c-means 
ensemble, "maxj" represents the maximum iteration number of the base A:-means. 



(a) Similarity matrix of feature at layer 1 (b) Similarity matrix of feature at layer 2 (c) Similarity matrix of feature at layer 3 




Fig. 8. Similarity matrices of the first three layers of 
DRME on the Optdigits data set with the base clustering 
generated from the random observation sampling. 



(a) Similarity matrix of feature at layer 1 



(b) Similarity matrix of fe 
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Fig. 9. Similarity matrices of the first three layers of 
DRME on the Optdigits data set with the base clustering 
generated from the completely random centers. 



only failed dictionary in p6| is the one that consists 
of completely random weights. The reason is that this 
dictionary does not satisfy the first basic criterion of en- 
semble learning. In other words, the base learners are too 
weak to be meaningful ones, (ii) The dictionary that is 
filled via random sampling from the input observations, 
which is a scheme coincident with the proposed DRME, 
can achieve the state-of-the-art performance. This is be- 
cause random sampling is not completely random, it 
can reflect the distribution information of the input and 
hence is probably the weakest meaningful base learner 
we can find. 

5.3 Replacing Clustering Ensemble With Random 
Model Ensemble 



As mentioned in Section 5.1 we have adopted the frame- 
work of the /c-means based clustering ensemble in p3). 



After viewing RBM, DAE and sparse coding as special 
cases of clustering ensemble and revisiting the basic 
criteria of clustering ensemble, we are ready to reduce 
the /c-means based clustering ensemble that is trained 
with the full EM training to the one that is trained with 
only one or even zero EM iteration. This reduction is 
mainly motivated from the biased approximation of the 
CD learning to maximum likelihood learning for DBN 
pT) , pS) , and is further supported by the success of the 
random sampling based dictionary learning when sparse 
coding is used as the encoder |16|. 

Fig. |7| illustrates the effectiveness of this reduction 
on the Optdigits. In this example, we use a subset of 
Optdigits that consists of only 2000 observations, and use 
only 200 base clusterings per layer. From the figure, we 
can observe that (i) even if we use a light experimental 
setting, the deep /c-means ensemble (DKME) is still 
very inefficient, even with only one EM iteration, while 
DRME is about two orders faster than the DKME with 
only one EM iteration; (ii) although DKME can reach 
good representations with less layers than DRME, both 
of them can reach equivalently good representations in 
very deep layers. 

Here comes the question. Can we use completely 
random centers instead of the centers that are randomly 
sampled from the data? No, from Fig. [9j we can see that 
when we use completely random centers, the similarity 
matrices are quite confused, while from Fig. [Sj we can 
observe that when we use the random observation sam- 
pling, the similarity matrices are getting clearer with the 
increase of the depth. 

As a conclusion, the random observation sampling is 
empirically a meaningful base clustering for the deep 
clustering ensemble. 

6 Experiments 

In this section, we will compare the proposed DRME 
algorithm with 5 referenced representation learning algo- 
rithms on 19 UCI benchmark data sets. All experiments 
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TABLE 2 

Descriptions of the data sets. The data sets that are 
marked with * are the randomly sampled subsets from 
the original data sets. 



ID 


Data 


Size in) 


Feature (d) 


Class ik) 


1 


Dermathology 


366 


34 


6 


2 


Iris 


150 


4 


3 


3 


Ecoli 


336 


4 


3 


4 


Wine 


178 


7 


8 


5 


Glass 


214 


9 


6 


6 


New-Thyroid 


215 


5 


3 


7 


Vowel 


990 


10 


11 


8 


Balance 


625 


4 


3 


9 


Yeast 


1484 


8 


10 


10 


Satimage* 


2000 


36 


6 


11 


Letter* 


2000 


16 


26 


12 


Pendigits* 


2000 


16 


10 


13 


Segmentation* 


2000 


19 


7 


14 


Optdigits* 


2000 


64 


10 


15 


Shuttle* 


2000 


9 


5 


16 


Vehicle 


846 


18 


4 


17 


Fea* 


2000 


87 


5 


18 


Libras 


168 


90 


7 


19 


Synthetic-Control 


600 


60 


6 



are conducted with MATLAB 7.12 on a 2.27 GHZ 8- 
core Itel(R) Xeon(R) Server running Windows XP with 
16 GB memory. The implementation of DRME can be 
downloaded from http://XXXXX, 

6.1 Experimental Settings and Comparison 
Schemes 

The experiments are performed on 19 UCI data sets ^ 
In this paper, we conduct 20 independent runs on each 
dataset and report the average results. For the original 
UCI data sets that are more than 2000 observations, we 
randomly sample 2000 observations for 20 times and 
conduct each independent run on different samplings. 
The detailed information of the data sets are listed in 
Table El 

For the proposed DRME, the depth is set to 15. The 
size of the dictionary is set to 500. The number of the 
base clusterings in each layer is set to 2000. The minimal 
number of clusters that the base clustering can achieve, 
i.e. kminr is set to 10. When the size of the data set is 
smaller than 500, the maximal number of clusters that 
the base clusterings can achieve, i.e. /cmax/ is set to 30, 
otherwise, /cmax is set to 100. 

To examine the effectiveness of the proposed DRME, 
we compare DRME with the following 5 representation 
learning methods. 

1) Shallow representation learning methods. 

• /c-means based clustering ensemble (KMCE) p3| . 
The number of the base clusterings is set to 2000. 

2. [http:/ / archive.ics.uci.edu/ ml| 



According to |[T3j, kmin is set to 10. /cmax is set to 
30. 

• Random model ensemble (RME). This is the 
DRME with a depth of only 1 layer. The other 
parameters are set to the same values as the 
proposed DRME. 

• Principle component analysis (PGA). The kernel 
PGA 1 34 1 toolbo)|^is used with the kernel type 
set to the linear kernel. The largest 100 eigen- 
values corresponding with their eigenvectors are 
preserved. 

2) Deep representation learning methods]^ 

• Deep belief networks (DBN) |2|. The depth is set 
to 5. The number of the hidden units in each 
layer is set to 200. The learning rate is set to 
0.005. The momentum is set to 0.9. The number 
of epoches for the unsupervised training is set to 
120. The batch size of observations is set to 1. 

• Stacked denoising autoencoder (SDAE) |^, ||7j. 
The depth is set to 5. The number of the hidden 
units in each layer is set to 200. The learning rate 
is set to 0.005. The fraction of the zero-masked 
inputs is set to 0.5. 

Note that the reason why we set the depth to 5 but 
not 15 (as we did in DRME) is because we found that 
the performance of DBN and SDAE drops significantly 
when the depth is extremely deep. 

For each representation learning method, the effective- 
ness of the learned representation is evaluated by the 
accuracy and number of clusters yielded from the single- 
linkage, where the accuracy is evaluated by NMI |12, Eq. 
(3)]. For the DRME-based single-linkage clustering, the 
parameter r] is set to 0.005. We will also report the highest 
NMI that the DRME-based single-linkage can achieve 
among the layers. The corresponding ideal method is 
denoted as zDRME. For the DBN-based and SDAE- 
based single-linkages, we pick up the highest NMIs they 
can achieve among all 5 layers. Note that this is an 
unfair comparison scheme to our DRME, but we dare 
to compare in this way. 

Besides the aforementioned representation learning 
methods, we will further provide the performance of the 
/c-mean^ provided with the true number of clusters. The 
corresponding method is denoted as KM'^. 

For all 6 representation learning methods, only the 
GPU time that is consumed on learning the represen- 
tations is recorded. 

6.2 Results 

In this subsection, we will compare the clustering accu- 
racy in terms of NMI, the detected number of clusters, 

3. The implementation code is in the SVM-KM toolbox ' http://asi. 
|insa-rouen.fr/ enseignants/ -arakotom/ toolbox/ index.html 

4. The deep learning toolbox is downloaded from ' ]https://github. 
com/rasmusbergpalm/DeepLearnToolbox 

5. The implementation code is in the VOICEBOX developed by Cam- 
bridge University for speech processing. It can be downloaded frorn 

" jhttp:/ / www.ee.ic.ac.uk/hp/ staff/ dmb/voicebox/ doc/ voicebox/kmeans.html| 
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TABLE 3 

NMI (in percentage) comparison. The digit in brackets is the standard deviation. The digit in italic and red color 
means that the corresponding method is over-fitting to the data set, hence it is meaningless. The digit in bold means 
that the corresponding method achieves the highest NMI on the data set. We test for confidence interval at 95% with 
the two-tailed t test. The column of "KM*" lists the NMIs of the A:-means provided with the true number of clusters. 
The column of "zDRME" lists the highest NMIs that the DRME can achieve. These two columns will not join in the 

comparison. 



ID 


Data 


KM* 


KMCE 


RME 


PCA 


DBN 


SDAE 


DRME 


iDRME 


1 


Dermathology 


83.87 (4.98) 


93.68 (0.00) 


54.23 (0.00) 


54.23 (0.00) 


83.06 (3.05) 


54.93 (5.35) 


82.57 (5.66) 


86.61 (2.44) 


2 


Iris 


71.76 (4.41) 


65.70 (0.00) 


76.12 (0.00) 


76.12 (0.00) 


70.81 (11.34) 


76.12 (0.00) 


68.62 (3.29) 


76.12 (0.00) 


3 


Ecoli 


59.16 (2.35) 


29.98 (7.69) 


55.86 (0.15) 


20.57 (0.00) 


32.03 (9.09) 


21.40 (0.84) 


54.62 (3.06) 


61.75 (3.04) 


4 


Wine 


83.08 (1.75) 


49.56 (4.72) 


67.50 (12.70) 


2.67 (0.00) 


23.48 (10.73) 


13.86 (16.13) 


66.60 (9.14) 


73.69 (3.94) 


5 


Glass 


32.33 (3.95) 


27.71 (9.59) 


39.97 (9.13) 


8.98 (0.00) 


18.20 (9.69) 


10.31 (1.70) 


37.35 (1.62) 


43.51 (6.62) 


6 


New-Thyroid 


60.27 (0.00) 


27.63 (0.00) 


17.41 (19.89) 


8.93 (0.00) 


18.27 (8.94) 


14.94 (11.05) 


47.56 (1.72) 


51.89 (3.39) 


7 


Vowel 


38.13 (2.44) 


10.53 (0.00) 


7.44 (3.30) 


13.64 (0.00) 


12.81 (9.82) 


12.77 (2.12) 


39.22 (10.29) 


46.50 (3.30) 


8 


Balance 


11.76 (6.63) 


29.45 (0.00) 


20.35 (17.30) 


3.92 (0.00) 


4.39 (0.76) 


37.67 (0.02) 


13.58 (4.82) 


33.24 (4.71) 


9 


Yeast 


27.68 (1.00) 


11.45 (0.00) 


26.62 (22.26) 


6.34 (1.17) 


9.31 (2.35) 


7.72 (1.10) 


22.93 (6.55) 


38.57 (10.60) 


10 


Satimage* 


61.79 (0.80) 


13.63 (15.72) 


3.88 (10.03) 


1.96 (0.77) 


31.92 (7.90) 


8.43 (11.95) 


48.76 (11.34) 


58.94 (1.55) 


11 


Letter* 


38.66 (1.55) 


10.12 (0.82) 


15.40 (21.39) 


33.65 (31.83) 


20.13 (23.50) 


55.89 (23.42) 


17.74 (7.50) 


41.38 (10.61) 


12 


Pendigits* 


67.64 (2.14) 


22.83 (9.11) 


15.00 (18.58) 


2.66 (0.99) 


33.67 (7.53) 


3.98 (1.86) 


60.74 (15.94) 


75.76 (2.30) 


13 


Segmentation* 


61.54 (1.61) 


63.26 (0.32) 


63.18 (0.69) 


49.92 (8.05) 


45.89 (0.35) 


49.82 (23.90) 


60.44 (10.40) 


66.14 (1.73) 


14 


Optdigits* 


73.17 (2.85) 


40.16 (7.68) 


18.27 (18.34) 


1.56 (0.48) 


20.60 (18.90) 


31.15 (27.09) 


67.34 (15.07) 


82.04 (2.09) 


15 


Shuttle* 


37.02 (3.94) 


46.26 (10.69) 


45.65 (13.26) 


29.44 (25.38) 


8.30 (9.49) 


30.49 (21.39) 


37.11 (8.24) 


56.65 (8.17) 


16 


Vehicle 


10.98 (2.06) 


10.13 (0.00) 


12.55 (16.48) 


1.43 (0.00) 


14.23 (3.12) 


6.58 (9.13) 


18.63 (3.65) 


29.02 (8.46) 


17 


Fea* 


15.81 (8.67) 


5.73 (7.18) 


6.47 (3.84) 


4.16 (2.61) 


7.80 (11.31) 


39.22 (0.46) 


12.47 (10.07) 


28.90 (3.05) 


18 


Libras 


48.22 (4.55) 


13.66 (0.00) 


13.66 (0.00) 


4.39 (0.00) 


21.33 (12.97) 


31.12 (23.31) 


60.62 (4.66) 


66.39 (0.73) 


19 


Synthetic-Control 


72.78 (2.42) 


82.71 (0.00) 


50.15 (0.00) 


50.15 (0.00) 


65.02 (8.95) 


50.15 (0.00) 


64.34 (20.59) 


82.23 (1.64) 



and the CPU time, respectively. 

Tables |3] and |4] list the clustering accuracy and the 
detected number of clusters respectively. We should con- 
sider the two tables jointly. Before our formal analysis, 
we have to note that when the detected number of 
clusters is very high, the clustering accuracy is gener- 
ally high, however, this is an illusion since the single- 
linkage fails to detect useful clusters. This is mostly 
caused by the roughly learned representation. There- 
fore, in our comparison, when the detected numbers of 
clusters in Table |4] are very high, we will not consider 
the corresponding results both in Table |4] and in Table 
|3] anymore. From the two tables, we can observe the 
following experimental phenomena, (i) The proposed 
DRME can achieve the highest NMIs and detect the 
true numbers of clusters in most of the 19 data sets. 

(ii) Generally, DRME achieves an accuracy as high as 
KM'^, and moreover, zDRME is even better than KM*^. 

(iii) The proposed representation selection scheme works 
quite well, while the referenced methods suffer more or 
less from the under-fitting problem including zDRME. 

(iv) The referenced deep learning approaches does not 
achieve the expected performance. One reason might be 
that the parameters are not well-tuned. However, we 
have no way to tune the parameters in the real-world 
unsupervised learning scenario, hence, only the empir- 
ically workable settings are adopted. Another reason is 
that the data is so small scale that it cannot meet the 
requirement of the parameter training. From this point 



of view, the proposed DRME can handle more general 
representation learning tasks. 

Table [4] lists the CPU time comparison. From the 
table, we can see that the proposed DRME is quite 
efficient when compared with KMCE and the two deep 
learning approaches. This phenomena demonstrates one 
significant merit of discarding the EM training of the 
base clustering in DRME. 

7 Conclusions 

In this paper, we have viewed several representative 
unsupervised representation learning algorithms as spe- 
cial cases of clustering ensemble. Based on this novel 
view, we have proposed a new deep clustering en- 
semble algorithm, named deep random model ensem- 
ble. In order to find the most powerful representation 
among the layers, we have further applied DRME to 
clustering. Specifically, (i) we have viewed the deep 
belief networks as a stack of probabilistic clustering 
ensemble, where each base clustering of the clustering 
ensemble is a binary-class probabilistic clustering. We 
have viewed the stacked denoising autoencoder as a 
stack of deterministic clustering ensemble, where each 
base clustering is a binary-class deterministic clustering. 
Moreover, when sparse coding is used as the feature 
encoder, we have viewed this usage of sparse coding 
as a kind of consensus function of clustering ensem- 
ble, and viewed the dictionary learning as the training 
process of the base clusterings of clustering ensemble. 
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TABLE 4 

Comparison of the detected number of clusters. The digit in brackets is the standard deviation. The digit in italic and 
red color means that the corresponding method is over-fitting to the data set, hence it is meaningless. The digit in 
bold means that the detected number of clusters is the closest one to the true number of clusters on the data set. 
The column of "True number" lists true numbers of clusters. The column of "zDRME" lists the numbers of the detected 

clusters of the DRME that achieves the highest NMIs in Table|3] These two columns will not join in the comparison. 



ID 


Data 


True number 


KMCE 


RME 


PCA 


DBN 


SDAE 


DRME 


iDRME 


1 


Dermathology 


6 


5.00 (0.00) 


2.00 (0.00) 


2.00 (0.00) 


4.95 (1.39) 


2.70 (0.57) 


5.30 (1.72) 


5.00 (1.97) 


2 


Iris 


3 


3.00 (0.00) 


2.00 (0.00) 


2.00 (0.00) 


7.25 (5.65) 


2.00 (0.00) 


6.45 (1.43) 


2.00 (0.00) 


3 


Ecoli 


3 


2.60 (0.49) 


4.90 (0.30) 


2.00 (0.00) 


8.70 (4.65) 


3.15 (0.37) 


7.95 (3.12) 


3.85 (0.59) 


4 


Wine 


8 


3.65 (0.73) 


52.85 (71.46) 


2.00 (0.00) 


13.65 (23.45) 


15.85 (38.06) 


6.10 (1.07) 


9.65 (3.44) 


5 


Glass 


6 


2.80 (1.25) 


68.70 (94.61) 


3.00 (0.00) 


6.25 (7.31) 


3.50 (1.40) 


8.00 (1.08) 


72.75 (94.36) 


6 


New-Thyroid 


3 


3.00 (0.00) 


2.60 (0.97) 


2.00 (0.00) 


5.10 (3.95) 


3.85 (3.36) 


8.90 (2.05) 


3.75 (1.94) 


7 


Vowel 


11 


2.00 (0.00) 


2.95 (1.32) 


6.00 (0.00) 


49.70 (121.38) 


7.85 (2.32) 


15.20 (8.40) 


23.60 (7.59) 


8 


Balance 


3 


81.00 (0.00) 


312.75 (310.05) 


2.00 (0.00) 


5.00 (12.25) 


623.60 (0.68) 


6.95 (4.56) 


348.70 (281.64) 


9 


Yeast 


10 


3.00 (0.00) 


727.65 (725.25) 


2.05 (0.22) 


5.80 (8.21) 


4.95 (1.57) 


10.45 (6.68) 


732.40 (739.23) 


10 


Satimage* 


6 


4.10 (2.55) 


102.95 (434.76) 


3.25 (1.67) 


16.80 (40.39) 


7.95 (3.89) 


9.35 (5.25) 


13.05 (3.12) 


11 


Letter* 


26 


2.30 (0.90) 


298.20 (703.84) 


990.40 (988.00) 


402.25 (808.63) 


1681.65 (723.79) 


4.35 (3.63) 


316.30 (714.33) 


12 


Pendigits* 


10 


2.85 (1.11) 


4.30 (3.48) 


3.30 (1.38) 


18.85 (39.96) 


4.70 (3.20) 


10.30 (6.43) 


11.90 (2.43) 


13 


Segmentation* 


7 


3.05 (0.22) 


3.05 (0.22) 


2.85 (1.49) 


2.05 (0.22) 


7.85 (4.07) 


11.05 (4.67) 


12.70 (2.75) 


14 


Optdigits* 


10 


3.30 (3.98) 


3.60 (1.59) 


2.75 (0.99) 


8.00 (18.07) 


1100.90 (1018.06) 


9.70 (6.28) 


11.50 (0.83) 


15 


Shuttle* 


5 


2.85 (0.85) 


3.15 (1.35) 


3.60 (1.28) 


26.95 (75.01) 


4.65 (2.41) 


13.70 (5.96) 


4.50 (3.22) 


16 


Vehicle 


4 


2.00 (0.00) 


171.05 (336.85) 


2.00 (0.00) 


11.95 (22.94) 


48.15 (187.57) 


8.35 (3.69) 


186.20 (337.87) 


17 


Fea* 


5 


2.60 (0.86) 


3.05 (2.42) 


11.80 (8.52) 


64.40 (186.86) 


612.45 (14.32) 


6.15 (6.38) 


22.45 (5.19) 


18 


Libras 


7 


2.00 (0.00) 


2.00 (0.00) 


2.00 (0.00) 


4.40 (2.95) 


57.55 (76.38) 


6.90 (0.97) 


12.55 (3.05) 


19 


Synthetic-Control 


6 


5.00 (0.00) 


2.00 (0.00) 


2.00 (0.00) 


2.35 (0.49) 


2.00 (0.00) 


11.25 (6.19) 


5.10 (0.79) 



(ii) Inspired by the above novel views, we might use 
any valid clustering ensemble to build a deep model, 
where the word valid means that the base clusterings 
should be better than the random guess and be diverse 
with each other. Inspired by the success of the biased 
approximation of the contrastive divergence learning to 
maximum likelihood learning for the deep belief net- 
works, we have proposed DRME. It is a reduction of the 
stacked /c-means ensemble to the stacked random model 
ensemble. A special point of the random model ensemble 
is that the k centers of its base clustering is k randomly 
sampled observations from the input observations, but 
not completely random ones, which accounts for the 
meaningfulness of the base clustering, (iii) To prevent the 
under-fitting and over-fitting of the learned representa- 
tion to the data simultaneously, we have proposed the 
DRME based single-linkage clustering, where the most 
powerful representation is selected as the first layer that 
the hierarchical-tree of the single-linkage becomes stable, 
(iv) As a by-product, the DRME based clustering also 
contributes to one basic problem of clustering - detecting 
the natural clusters. We have conducted an extensive 
experiment. The experimental results have shown that 
the proposed DRME is more powerful than 5 state- 
of-the-art representation learning algorithms in terms 
of clustering accuracy, and moreover, it is even more 
powerful than the /c-means clustering provided with the 
true number of clusters. 
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